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Abstract 

We study human dynamics by analyzing Linux history files. The goodness-of-fit test 
shows that most of the collected datasets belong to the universality class suggested 
in the literature by a variable-length queueing process based on priority. In order 
to check the validity of this model, we design two tests based on mutual informa- 
tion between time intervals and a mathematical relationship known as the arcsine 
law. Since the previously suggested queueing process fails to pass these tests, the 
result suggests that the modelling of human dynamics should properly consider the 
statistical dependency in the temporal dimension. 
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1 Introduction 

Recently, there have been various attempts to characterize the human behav- 
iors in mathematical terms which have been successfully applied to natural 
phenomena. To name a few, the human activities like Internet, traffic flows, 
family names and stock prices are under active investigation and give deep 
insights into our society [1,2,3,4,5 6J. Now we have even a popularized term 
known as 'human dynamics' [7], and many researchers are devoting themselves 
to this field. One of their most surprising claims is that there exist a few 
universality classes in human dynamics [8f9yi0] . Those universality classes are 
described by a priority-based queuing process, which yields power-law waiting- 
time distributions p(r) ~ r~ Q with universal exponents of a = 1.0, 1.5, and 
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2.5 [TT]. This idea has generated great public attention due to its philosophical 
implications against our conventional belief in human conditions, and there 
have been also intensive scientific debates on their observations [12] and on 
the existence of universality classes [32]. Even if one may doubt the validity 
of the universality claim, their original observation truly pointed out some 
fundamental properties of human behaviors and contributed a lot to this field 
by proposing a powerful and falsifiable model. To our knowledge, only a few 
models are yet to undergo closer examinations, including those in Refs. [T¥|IT5] . 

In this article, we analyze human behaviors through Linux history files, which 
contain the histories of every shell command input by terminals. Unlike the 
records in supercomputers (16] and personal computers including mouse move- 
ments [17], our observation partially supports the universality claim in that 
most of the collected distributions fall into the suggested universality class 
with a = 1.5. Since this is the regime where a priority-based queue model 
works with varying the queue length [5], we may imagine that a person works 
as the model describes, where each command executed on the shell introduces 
the next command to her queue with a randomly assigned priority. The wait- 
ing time before execution is essentially dominated by a random walk of the 
queue length, which gives the desired power-law distribution with a = 1.5 |18j . 
However, the overall distribution shows only a small amount of information 
and it is more than possible to devise further examinations to compare our 
empirical data with the suggested model. In other words, if command inputs 
can be described by the queue model which reduces to a one-dimensional ran- 
dom walker, such a simple and rigorous mechanism should put some explicitly 
testable constraints on the result. For example, a natural requirement is that 
the time intervals between two consecutive events must be mutually indepen- 
dent of each other. That is our motivation to design two tests based on the 
correlation between time intervals and the characteristic hitting time distri- 
bution as a regenerative process [19], respectively. These tests prove that our 
observations are not fully explained by the existing queue models. 

This article is organized as follows: In Sec. [21 we explain how we prepared 
datasets and present their basic statistical features. In Sec. [3j the goodness- 
of-fit test for verifying power-law behaviors is followed by two tests to examine 
the priority-based queue model. Then we discuss the implications of the test 
results in Sec. @] and conclude this work in Sec. 



2 Data Collection 

A Linux system usually keeps every user's shell command history up to some 
predefined length. In Bash (Bourne-again shell), each shell command can be 
made accompanied by a time-stamp, if we add a couple of lines to a resource 



2 



file called '.bashrc' as in Fig. [T](a), where the first line defines the maximal 
history length and the second adds time-stamps. Then a user's typical history 
is written as in Fig. QJb) with the numbers indicating time-stamps in units 
of second. We collect seven history files from six users (including two authors 
of the present paper), each of which is given an alphabet from A to G. Since 
they worked without any explicit coordination during the recording period, we 
regard these records as mainly reflecting their individual characteristics. Note 
that the history files may not be arranged in a chronological order when a user 
uses multiple terminals so that we have to sort the datasets before analyses. 
In addition, we generate one more dataset R, recording the return times of a 
one-dimensional random walker to the origin, which will function as a control 
group throughout our analyses. Fig. [2] shows the input rates for the datasets 
A and R by counting the number of events in every hour, or in 60 2 = 3600 
time units (seconds). 

Letting £j indicate the time when the ith command was entered, we define the 
waiting time between two consecutive inputs as 

Conversely, if there are n command inputs recorded in the file, T = J27=i r * * s 
the total time interval. There are two other characteristic quantities, f = T/n 
and r max = maxjjrj}, although we will see later that f is not a good statistic 
in that it is essentially sample-dependent here. Those values for each dataset 
are listed in Table HJ 

Furthermore, one may be interested in the transition between commands. 
Suppose that when a command c\ is followed by a command c<i in the shell, 
we call it a transition from c\ to ci- The transition patterns are easily visualized 
by a network, as shown in Fig. [3] by nodes (commands) and links (transitions). 
As the largest frequency is found in the transition from 'Is' to 'cd', only those 
links are depicted whose transition frequencies exceed 2% of it, together with 
the major commands involved in these transitions. 



3 Data Analysis 

3.1 Waiting-time distribution 

Let us consider the probability distribution p(r), obtained from an empirical 
dataset, {r,}. For convenience, we are going to work with its derived form, the 
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cumulative distribution function defined as follows [IB] : 



P( r ) = /" p( r ') dr'. 



Fig. H] displays P(r)'s for our eight datasets. The distribution function in 
Fig. IH(d) looks far from a straight line, presumably because this user prefers 
using his own personal Linux machine to accessing the remote server where 
the recording has been carried out. In comparison, the distribution in Fig.@Je) 
which is for the same user as in (d) (he submitted two history files D from a 
remote server and E from his own local desktop, respectively) does not show 
any significant difference from other datasets. It evidently shows an effect 
of individual characteristics, and the concave shape is reminiscent of the job 
submission interval in supercomputers [16]. Nevertheless, if we exclude the 
dataset D, the qualitative behaviors are surprisingly similar to each other. 

Every arrow in Fig. H] indicates the point at 4.32 x 10 4 s, or 12 h. The exis- 
tence of a hump for each individual appears to reflect his daily life cycle. The 
Fourier transform also confirms that the working pattern is quite regular, as 
two peaks are prominent at one day and one week (Fig. [5]). Shown differently, 
the autocorrelation function from the input rate [2] exhibits oscillatory pat- 
terns with periods around 24 h (Fig. [6]). All of these indicate the existence of 
long-term orders. 

We fit each dataset using a power-law function p(r) ~ r~ a with an appropriate 
lower bound r min , where the number of data points larger than or equal to 
r min is denoted as n ta n. The optimal parameter values are listed in Table [21 
We apply the goodness-of-fit test based on the Kolmogorov-Smirnov (KS) 
statistic and measure the p-value, the probability that a dataset was drawn 
from the hypothesized distribution [20]. As shown by p- values in Tabled the 
power-law function is found to be at least a moderate description for all the 
datasets except D. The humps due to long-term orders do little harms in the 
test, because the KS test tends to be sensitive to the deviations in the body 
part, rather than those in the tail part with a much smaller scale. We do 
not treat the model selection problem [20] here, but it would not be a big 
surprise if they converge to the Levy stable distribution in the long run by the 
generalized Central Limit Theorem. 
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3.2 Mutual information 



To further analyze behavioral patterns, some authors employ the conventional 
autocorrelation function [2T] : 

I n-j 

n J i=i 



Note that this function assumes the well-behavedness of statistical moments 
such as the average f . None of them are well defined for power-law distributions 
with a < 2 [18], and the large variation of f in Table Q] implies this deficiency. 

We next try to devise an alternative measure for the correlation, which should 
be zero for perfectly uncorrelated data. A possible trick is reverting the gener- 
ation algorithm for power-law distributed random numbers: If r is a random 
number uniformly drawn from [0, 1), the formula given as 

x = x min (l-r)- 1 /(-D 



makes a power- law distribution p(x) ~ x~ a with a lower bound x m i n |20j . 
Therefore, if {x^} is a set of power-law distributed random numbers, the inverse 
transformation 




will generate a set of random numbers uniformly distributed between [0,1). 
In discrete case as ours, each Xi is not mapped to a unique point, but to a set 
of points ranged over 




1-qN 



Since every number within this range is equivalent, a reasonable choice is to 
draw a point r{ randomly within the interval. This indeterminacy makes some 
fluctuations on the final result, but this trick still works giving us consistent 
estimates. The mutual information between consecutive points is then calcu- 
lated as [22] 



P{r i+ i,rj) 
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where p(ri+i,ri) means the joint probability density function of r^+i and r^. 
Accordingly, if and r i+1 are completely uncorrelated, / takes the null value. 

Applying this transformation to each collected dataset, {r^}, we obtain the 
transformed result, {ui}, whose number of elements is n tai i. Introducing Hi = 
-J2p(ui) log p(ui) and H 2 = -J2p(ui+i,Ui) log p(u i+1 ,Ui) as the entropy of 
{ui} and the joint entropy of {ui + i,Ui}, respectively, we rewrite Eq. p]) as 
follows: 

I(u i+ i,Ui) = 2Hi - H 2 . 



Normalizing this with respect to the entropy, we get the following quantity to 
measure how much correlation a dataset contains: 



In practice, p(u) du is estimated by the number of data points between [u, u + 
du), and the values of entropies are dependent on the choice of du, or equiva- 
lently, the number of bins in making a histogram. We choose Sturges' formula 
to determine the optimal number of bins [23J : 

k = \log 2 retail + 1] , 



where \x] means the ceiling function of x. The results are shown in Fig. [7J 
Every h lies at around 1%, which is not much larger than our expectation. 
However, we have to check if those values are small enough to conclude that the 
data are indeed uncorrelated. A common technique to find a reference point is 
by using surrogates [22]: To destroy all the possible correlation without altering 
the overall distribution, it suffices to perform a random shuffle on the data. 
Then we calculate the mutual information from an ensemble of such surrogate 
datasets. As shown in Fig. [7J while this method makes little differences in R, 
every other human dataset is found to carry mutual information to a significant 
degree. Therefore this implies that our datasets have differences from what the 
previously proposed priority-based queue model predicts from the viewpoint 
of mutual information. 

Before proceeding, however, some subtlety should be mentioned: Since this 
test simply neglects all the Tj's smaller than r min , some pairs of (ttj, u i+ i) may 
not come from the really consecutive time intervals. If we further require such 
consecutiveness, the number of available data pairs becomes even less than 
ntaii, making the results also unclear for some datasets. Therefore, even though 
we could reveal some quantitative differences, they are not so conclusive as the 
constraint r > r min and the indeterminacy for a discrete case severely worsen 
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the power of this test. It is for the reason that we newly propose another test, 
taking all the data points into consideration. 



3.3 Arcsine test 

Suppose that a one- dimensional random walker in the x-direction starts at 
time t — from the origin at x — 0. It is then mathematically proved that 
the probability / for the walker to hit the origin (x = 0) in the time interval 
t) with < £ < 1 is given by [H] 



as t — > oo. Let us check how much our datasets are away from this result. For a 
given t, we can estimate the probability function, 77 = / e (£)> from our datasets 
with varying £. It is a monotonically decreasing function by definition, and the 
following KS statistic will properly quantify the the maximum deviation |24j : 




Note that since each measured point should contribute an equal amount in the 
KS test, the inverse functions are more appropriate. Due to the assumption 
that t — > 00 which Eq. (j2J) is based on, we observe how the statistic d behaves 
as t increases (Fig. EJ). We only use the time t less than 10% of the total 
recording period T in order to avoid effects caused by the finiteness of the 
time. As clearly shown in Fig. [BJ only the dataset R maintains low d at large 
t. Moreover, one may easily calculate the significance level from the fact that 
/ e (£) is constructed with the effective number of points N e = 49 (see Ref. [21] 
for details of the KS test): The dataset R has d ~ 0.12 at t ~ 0.1T which 
amounts to the significance level of Qks ~ 46%. Even if this does not satisfy 
the usual requirement like 95% significance level, it is still remarkable since 
we observe that every other human dataset has Qks < 10~ 20 under the same 
condition. Consequently, it is very plausible that the human dynamics in our 
datasets needs a modified description than a simple random walker, which is 
also supported by the previous test based on the mutual information. 



4 Discussion 

One interesting point in our observation is that the datasets show heavy- 
tailed distributions up to some cutoffs and, at the same time, quite regular 




(2) 




d = mBx\f- l ( V )-r 1 (v)\- 



(3) 
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behaviors. This is seemingly contradictory, as a power-law distribution with 
an exponent a < 2 is known to make its average and variance diverge, lacking 
any characteristic time scales. 

Even though one may say that the irregularity still exists in intra-day scales 
at least (see Fig. @J, it is true that a power-law distribution does not nec- 
essarily imply highly complicated dynamics. Let us consider a very simple 
example with N persons, each of whom has her own working frequency, 
fi (i = 1, ■ ■ • , N). If these frequencies are uniformly distributed for a rather 
wide range of time scales, the collected set of waiting times will have a power- 
law distribution. Namely, the number of each person's own time interval is sim- 
ply an inverse quantity of f\, and should have the following distribution [T8] : 



Even a single person may have multiple working phases, each of which re- 
quires a different frequency of inputs but occupies roughly the same time as 
other phases. Indeed, the exponent a ~ 2 is already reported in Refs. [T3]rTT] . 
and modulating the exponent is not impossible because any random parts 
in fragmenting time schedules are basically a multiplicative process yielding 
power-law or log- normal distributions [25J. 

This is an illustrative, if not serious, example to show that there may be 
a number of competing theories, all of which yield power-law distributions 
with being totally different in other respects [IB] . We have thus focused on 
consistency checks for a previously suggested model, while leaving how to 
elaborate on a new one to be a future work. We stress that rejecting the 
existing queue model in our case does not mean that it is wrong or useless. 
Rather, our study shows one of its greatest virtues, i.e., the openness to a 
variety of challenges. Therefore, the queueing scheme is still a good starting 
point to consider human behaviors at the first approximation in a variety of 
situations, once one keeps in mind how a current simple model may deviate 
from reality. 



5 Conclusion 

We collected human behavioral patterns from Linux history files and found 
that their waiting-time distributions followed power-laws. Since they seemed 
to fall into the previously claimed universality class, characterized by an ex- 
ponent of 1.5, we tested the corresponding priority-based queue model by two 
measures. The first test was based on the mutual information, while the sec- 
ond was on the arcsine law in a regenerative process. Both tests indicated that 
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our datasets had significant differences from what the model of our concern 
predicted. This implies that we should also consider the temporal relations in 
order to find an accurate description of human behaviors. 
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(a) 

export HISTSIZE=1000000 

export HISTTIMEFORMAT=%F\ %T 



(b) 

#1165822911 
vi a.txt 
#1165822917 
cd.. 



Fig. 1. (a) Two lines added to .bashrc to record time-stamps up to 10 6 lines, (b) A 
typical look at the resulting history file, which contains one time-stamp above every 
command input to the Linux shell. 



Table 1 

Basic quantities: Datasets were collected from seven history files (A-G) and R was 
generated by a simple computer program simulating a one-dimensional random 
walk. The columns titled as n and T are the number of recorded commands and 
the total recording period, respectively. The third column gives the average time 
interval f = T/n, while the last one shows the maximal interval in each dataset. 



Dataset 




n 








T 






f 




''"max 


A 


9.3 


X 


10 4 


2 


.9 


X 


10 7 


3.1 


X 


10 2 


1.1 x 10 6 


B 


2.2 


X 


10 4 


2. 


.9 


X 


10 7 


1.3 


X 


10 3 


1.0 x 10 6 


C 


1.3 


X 


10 4 


2. 


.9 


X 


10 7 


2.1 


X 


10 3 


2.4 x 10 6 


D 


3.0 


X 


10 4 


2. 


.7 


X 


10 7 


8.9 


X 


10 2 


1.4 x 10 6 


E 


4.8 


X 


10 4 


2. 


.1 


X 


10 7 


4.4 


X 


10 2 


5.2 x 10 5 


F 


1.6 


X 


10 4 


2. 


.9 


X 


10 7 


1.8 


X 


10 3 


1.8 x 10 6 


G 


1.2 


X 


10 4 


1. 


.5 


X 


10 7 


1.2 


X 


10 3 


1.2 x 10 6 


R 


2.0 


X 


10 4 


1. 


.3 


X 


10 8 


6.3 


X 


10 3 


3.0 x 10 7 
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— — — K 1 m 1 

10000 20000 30000 40000 

Hour 

Fig. 2. (a) Command input rate of the dataset A, measured by the number of inputs 
in every hour (= 3.6 x 10 3 s). (b) A one-dimensional random walker's number of 
return to the origin in every hour (= 3.6 x 10 3 time steps), from the dataset R. 
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Fig. 3. Command input pattern of the dataset A. Self-loops are omitted, and the 
dahsed edges without arrows represent the transitions in both directions. The com- 
mands named as 'wsub' and 'wstat' are not supported by ordinary Linux shells but 
specific to this Linux machine, while 'exe' indicates all the user-generated executable 
files. 
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Fig. 4. Cumulative distributions of waiting times in collected datasets. The panels 
from (a) to (g) correspond to the empirical datasets from A to G, respectively, while 
the panel (h) indicates the dataset R from a random walker. Each arrow indicates 
r = 12 h. 
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Fig. 5. Fourier transformation of input rates in the dataset A. 
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Fig. 6. (Color online) Autocorrelation of input rates in each dataset. We depict only 
four datasets which clearly show oscillatory patterns with a period of 24 h. 
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Observed txxxxx^ 
Surrogate 




ABCDEFGR 
Datasets 



Fig. 7. Mutual information between consecutive time intervals for the observed 
datasets and their surrogates. 



1 




1 ^ 

0.05 0.1 

t/T 



Fig. 8. (Color online) Deviation d between the estimated probability function and 
the arcsine law [see Eq. ([3]) and text] as t increases up to 10% of the total record- 
ing period, T. Except for R from the random walk, all datasets are shown not to 
converge to the arcsine law. 
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Table 2 

Results of the goodness-of-fit test: Each dataset is fitted to the power-law distribu- 
tion with a lower bound r m - m and an exponent a by the KS test. The number of 
points satisfying Tj > r m i n is denoted as n ta ii, and a p- value means the probability 
that the power-law hypothesis is correct. 



Dataset 


7"min 


a 


"tail 


p- value 


A 


60 


1.74 


1.1 x 10 4 


0.47 


B 


24 


1.47 


4.4 x 10 3 


0.34 


C 


32 


1.50 


3.1 x 10 3 


0.61 


D 


13 


1.57 


4.8 x 10 3 


0.00 


E 


174 


1.62 


3.5 x 10 3 


0.62 


F 


25 


1.48 


3.9 x 10 3 


0.22 


G 


26 


1.52 


2.2 x 10 3 


0.20 


R 


36 


1.50 


1.9 x 10 3 


0.84 
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